Multilingual Sentence Categorization According to Language 1 Categorization According to Language 1.1 from Text Categorization

نویسنده

  • Emmanuel Giguet
چکیده

Issues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available , sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocessing of multilingual syntactic parser. The major diiculties in sentence cat-egorization are convergence and tex-tual errors. Convergence since dealing with short entries involve discarding languages from few clues. Textual errors since documents coming from diierent electronic ways may contain spelling and grammatical errors as well as character recognition errors generated by OCR. We describe here an approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency. The implementation is fast, small, robust and textual errors tolerant. Tested for french, english, spanish and german discrimination, the system gives very interesting results, achieving in one test 99.4% correct assignments on real sentences. The resolution power is based on grammatical words (not the most common words) and alphabet. Having the grammatical words and the alphabet of each This Paper is published in the Proceedings of the European Chapter of the Association for Computational Linguistics SIGDAT Workshop \From text to tags : Issues in Multilingual Language Analysis" held March 95 in Dublin. language at its disposal, the system computes for each of them its likelihood to be selected. The name of the language having the optimum likelihood will tag the sentence | but non resolved ambiguities will be maintained. We will discuss the reasons which lead us to use these linguistic facts and present several directions to improve the system's classiica-tion performance. Categorization sentences with linguistic properties shows that diicult problems have sometimes simple solutions. Emergence of text categorization according to language came with the need of processing texts coming from all over the world. The goal of text cat-egorization is to tag texts with the name of the language in which they are written. Information retrieval is the main application eld. To do this job, the traditionnal way is to exploit the diierence between letter combinations in different languages (Cavnar and Trenkle, 1994). For each language, the system computes from a training set a proole based on frequency (or probability) of letter sequences. Then, for a given text, it computes a proole and select the language which has the closer proole. …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Sentence Categorization according to Language

Issues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available, sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocess-ing of multilingual sy...

متن کامل

Dynamic Categorization of Semantics of Fashion Language: A Memetic Approach

Categories are not invariant. This paper attempts to explore the dynamic nature of semantic category, in particular, that of fashion language, based on the cognitive theory of Dawkins’ memetics, a new theory of cultural evolution. Semantic attributes of linguistic memes decrease or proliferate in replication and spreading, which involves a dynamic development of semantic category. More specific...

متن کامل

Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization

Cross-language Text Categorization is the task of assigning semantic classes to documents written in a target language (e.g. English) while the system is trained using labeled documents in a source language (e.g. Italian). In this work we present many solutions according to the availability of bilingual resources, and we show that it is possible to deal with the problem even when no such resour...

متن کامل

Cross Language Text Categorization by Acquiring Multilingual Domain Models from Comparable Corpora

In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian). In this pape...

متن کامل

The Use of WordNets for Multilingual Text Categorization: A Comparative Study

The successful use of the Princeton WordNet for Text Categorization has prompted the creation of similar WordNets in other languages as well. This paper focuses on a comparative study between two WordNet based approaches for Multilingual Text Categorization. The first relates on using machine translation to access directly the princeton WordNet while the second avoids machine translation by usi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007